Thesis Stereotyping the Web: Genre Classification of Web Documents
نویسنده
چکیده
OF THESIS STEREOTYPING THE WEB: GENRE CLASSIFICATION OF WEB DOCUMENTS Retrieving relevant documents over the Web is a difficult task. Currently, search engines rely on keywords for matching documents to user queries. This paper explores the potential for discriminating documents based on the genre of the document. I define genre as a taxonomy that incorporates the style, form and content of a document which is orthogonal to topic, with fuzzy classification to multiple genres. I explore how to automate the classification of Web documents according to their genres. Over 1,600 features of genres are identified and selection methods examined for distinguishing documents between ten genre types. Classification of documents using Bayes Net on a subset of 75 features achieved 90% accuracy. Elizabeth Sugar Boese Department of Computer Science Colorado State University Fort Collins, CO 80523 Spring 2005
منابع مشابه
Genre Classification of Web Documents
Retrieving relevant documents over the Web is an overwhelming task when search engines return thousands of Web documents. Sifting through these documents is time-consuming and sometimes leads to an unsuccessful search. One problem is that most search engines rely on matching a query to documents based solely on topical keywords. However, many users of search engines have a particular genre in m...
متن کاملGenre Classification of Web Pages
Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. W...
متن کاملMultiple sets of features for automatic genre classification of web documents
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject ...
متن کاملThe Impact of Noise in Web Genre Identification
Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on t...
متن کاملEffectiveness of web search results for genre and sentiment classification
The motivation of this study is to enhance general topical search with a sentiment-based one where the search results (called snippets) returned by the Web search engine are clustered by sentiment categories. Firstly we developed an automatic method to identify product review documents using the snippets (summary information that includes the URL, title, and summary text), which is considered a...
متن کامل